Wrangling data in R

Leonard Blaschek

“80% of data analysis is data cleaning”

— Ancient wisdom

A quick word about myself

R fundamentals

ggplot()            # function
ggplot              # object
996107              # number
"ggplot"            # string
?ggplot()           # show help page 
library(tidyverse)  # use library() to load the tidyverse package

R fundamentals

?read_tsv()

R fundamentals

?readr # navigate to package index and then vignettes

Types of data

  1. Excel sheets
  2. Delimited text files
  3. Folders of raw data
  1. Insane, lawless text files
  2. Proprietary formats

Excel sheets

Tidy Data1

  • No white space
  • One observation per row
  • One Variable per column
  • No information in formatting



Spot the untidyness!

Delimited text files

CSV

CSV2

TSV

Folders of files

list.files()

Create a read-in function

Data cleaning

Missing values

Long and wide data

Separating compound variables

Correcting data classes

Data analysis

Grouping

Mutate

Summarise

purrr

When you’re stuck

  1. Know which package/function you need? — Help pages and vignettes!
  2. Know what you want to do but not where to start? — Try an LLM, e.g. perplexity.ai
  3. I feel like I’ve done this before … — Keep your old scripts organised and annotated, chances are you’ll need that little hack you came up with again in a month or two.

Exercises!

Open up 2023_ggplot2_exercises.rmd and give it a try

Resources to go further